Olmo 3

mentions 1 type Person feed RSS

// recent coverage 1 mentions

19:45

2026-06-14

lesswrong.com

ai-safety

Why Do Naive SFT Filters For Safety Properties Fail?

Google DeepMind researchers investigate why filtering supervised fine-tuning (SFT) data fails to remove safety-relevant properties from language models, proposing a method to identify the source of th…

// co-occurs with top 3 entities

Google DeepMind 1 Gemini 3 Flash 1 MATS 1